# README — Datasets for CausalBGM (ACIC 2018 & Twins)

This deposit provides the **two datasets** used in the *CausalBGM* paper:

* **ACIC_2018**: ACIC 2018 semi-synthetic benchmark (binary treatment)
* **Twins**: Twins-based semi-synthetic benchmark (continuous treatment)

The folder structure in this Dataverse deposit matches the paths used in our data loaders, except that **Dataverse may store tabular files as `.tab`**.

---

## 1) ACIC_2018 (ACIC 2018 semi-synthetic benchmark; binary treatment)

**Folder:** `ACIC_2018/`

This dataset comes from the **2018 Atlantic Causal Inference Conference (ACIC) competition** semi-synthetic benchmark (binary treatment). It includes:

* baseline **covariates** (`v`) from `x.tab`
* **treatment** (`x`, binary) stored as column `z`
* **factual outcome** (`y`) stored as column `y`
* multiple **semi-synthetic settings** indexed by a unique setting ID (`ufid`)

### Key files

* `ACIC_2018/x.tab`
  Covariates indexed by `sample_id`.
* `ACIC_2018/scaling/factuals/<UFID>.tab`
  Factual outcomes for one semi-synthetic setting indexed by `sample_id`.
* (Optional, if used) `ACIC_2018/scaling/counterfactuals/<UFID>_cf.tab`
  Counterfactual outcomes for evaluation.

### Minimal loading example

```python
from CausalBGM import Semi_acic_sampler

ds = Semi_acic_sampler(
    path="../data/ACIC_2018",
    ufid="d5bd8e4814904c58a79d7cdcd7c2a1bb"
)
```

---

## 2) Twins (Twins semi-synthetic benchmark; continuous treatment)

**Folder:** `Twins/`

This benchmark is constructed from a large twins cohort and is widely used in causal inference evaluation. In our setup:

* the **treatment** `x` is a **continuous** variable derived from twins’ birth weights (converted to kg),
* baseline **covariates** `v` come from the `X` file,
* the **outcome** `y` is generated semi-synthetically in the sampler (with a known functional form plus noise), using the loaded covariates and treatment.

### Key files

* `Twins/twin_pairs_X_3years_samesex.csv`
  Covariates for twin pairs (after removing ID columns in preprocessing).
* `Twins/twin_pairs_T_3years_samesex.csv`
  Treatment-related variables (birth weights for the two twins).
* `Twins/twin_pairs_Y_3years_samesex.csv`
  Outcome-related file included for completeness.

### Minimal loading example

```python
from CausalBGM import Semi_Twins_sampler

ds = Semi_Twins_sampler(
    path="../data/Twins",
    batch_size=32,
    seed=0
)
```

---

## File format note: `.csv` vs `.tab` in Dataverse

When uploading to Dataverse, some tabular files may be stored as **archival `.tab`** files.
If your local code expects `.csv` paths, you can either:

* **(Recommended)** keep the same folder structure and adjust your local loader/config to point to the `.tab` files, or
* rename downloaded `.tab` files to `.csv` (only if you know your downstream scripts require `.csv` filenames).

The content is unchanged; this is a Dataverse storage/export format behavior.

---

## Contact

If you have issues in using the data, please contact me at qiao.liu@yale.edu.
